NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Incentivize without Bonus: Provably Efficient Model-based Online Multi-agent RL for Markov Games

Yang, Tong; Dai, Bo; Xiao, Lin; Chi, Yuejie (July 2025, PMLR)

Multi-agent reinforcement learning (MARL) lies at the heart of a plethora of applications involving the interaction of a group of agents in a shared unknown environment. A prominent framework for studying MARL is Markov games, with the goal of finding various notions of equilibria in a sample-efficient manner, such as the Nash equilibrium (NE) and the coarse correlated equilibrium (CCE). However, existing sample-efficient approaches either require tailored uncertainty estimation under function approximation, or careful coordination of the players. In this paper, we propose a novel model-based algorithm, called VMG, that incentivizes exploration via biasing the empirical estimate of the model parameters towards those with a higher collective best-response values of all the players when fixing the other players’ policies, thus encouraging the policy to deviate from its current equilibrium for more exploration. VMG is oblivious to different forms of function approximation, and permits simultaneous and uncoupled policy updates of all players. Theoretically, we also establish that VMG achieves a near-optimal regret for finding both the NEs of two-player zero-sum Markov games and CCEs of multi-player general-sum Markov games under linear function approximation in an online environment, which nearly match their counterparts with sophisticated uncertainty quantification.
more » « less
Free, publicly-accessible full text available July 13, 2026
Efficient Online Reinforcement Learning for Diffusion Policy

Ma, Haitong; Chen, Tianyi; Wang, Kai; Li, Na; Dai, Bo (July 2025, PMLR)

Diffusion policies have achieved superior performance in imitation learning and offline reinforcement learning (RL) due to their rich expressiveness. However, the conventional diffusion training procedure requires samples from target distribution, which is impossible in online RL since we cannot sample from the optimal policy. Backpropagating policy gradient through the diffusion process incurs huge computational costs and instability, thus being expensive and not scalable. To enable efficient training of diffusion policies in online RL, we generalize the conventional denoising score matching by reweighting the loss function. The resulting Reweighted Score Matching (RSM) preserves the optimal solution and low computational cost of denoising score matching, while eliminating the need to sample from the target distribution and allowing learning to optimize value functions. We introduce two tractable reweighted loss functions to solve two commonly used policy optimization problems, policy mirror descent and max-entropy policy, resulting in two practical algorithms named Diffusion Policy Mirror Descent (DPMD) and Soft Diffusion Actor-Critic (SDAC). We conducted comprehensive comparisons on MuJoCo benchmarks. The empirical results show that the proposed algorithms outperform recent diffusion-policy online RLs on most tasks, and the DPMD improves more than 120% over soft actor-critic on Humanoid and Ant.
more » « less
Free, publicly-accessible full text available July 13, 2026
Efficient Duple Perturbation Robustness in Low-rank MDPs

Hu, Yang; Ma, Haitong; Li, Na; Dai, Bo (June 2025, PMLR)
Ozay, Necmiye; Balzano, Laura; Panagou, Dimitra; Abate, Alessandro (Ed.)
The pursuit of robustness has recently been a popular topic in reinforcement learning (RL) research, yet the existing methods generally suffer from computation issues that obstruct their real-world implementation. In this paper, we consider MDPs with low-rank structures, where the transition kernel can be written as a linear product of feature map and factors. We introduce *duple perturbation* robustness, i.e. perturbation on both the feature map and the factors, via a novel characterization of (𝜉,𝜂) -ambiguity sets featuring computational efficiency. Our novel low-rank robust MDP formulation is compatible with the low-rank function representation view, and therefore, is naturally applicable to practical RL problems with large or even continuous state-action spaces. Meanwhile, it also gives rise to a provably efficient and practical algorithm with theoretical convergence rate guarantee. Lastly, the robustness of our proposed approach is justified by numerical experiments, including classical control tasks with continuous state-action spaces.
more » « less
Free, publicly-accessible full text available June 4, 2026
DF$^2$: Distribution-Free Decision-Focused Learning

Kong, Lingkai; Mu, Wenhao; Cui, Jiaming; Zhuang, Yuchen; Prakash, B Aditya; Dai, Bo; Zhang, Chao (July 2025, AUAI Press)

Free, publicly-accessible full text available July 21, 2026
Scalable spectral representations for multiagent reinforcement learning in network MDPs

Ren, Zhaolin; Zhang, Runyu; Dai, Bo; Li, Na (May 2025, PMLR)
Li, Yingzhen; Mandt, Stephan; Agrawal, Shipra; Khan, Emtiyaz (Ed.)
Free, publicly-accessible full text available May 3, 2026
Scalable spectral representations for multiagent reinforcement learning in network MDPs

Ren, Zhaolin; Zhang, Runyu; Dai, Bo; Li, Na (May 2025, Proceedings of Machine Learning Research)
Li, Yingzhen; Mandt, Stephan; Agrawal, Shipra; Khan, Emtiyaz (Ed.)
Network Markov Decision Processes (MDPs), which are the de-facto model for multi-agent control, pose a significant challenge to efficient learning caused by the exponential growth of the global state-action space with the number of agents. In this work, utilizing the exponential decay property of network dynamics, we first derive scalable spectral local representations for multiagent reinforcement learning in network MDPs, which induces a network linear subspace for the local $$Q$$-function of each agent. Building on these local spectral representations, we design a scalable algorithmic framework for multiagent reinforcement learning in continuous state-action network MDPs, and provide end-to-end guarantees for the convergence of our algorithm. Empirically, we validate the effectiveness of our scalable representation-based approach on two benchmark problems, and demonstrate the advantages of our approach over generic function approximation approaches to representing the local $$Q$$-functions.
more » « less
Free, publicly-accessible full text available May 3, 2026
Primal-Dual Spectral Representation for Off-policy Evaluation

Hu, Yang; Chen, Tianyi; Li, Na; Wang, Kai; Dai, Bo (May 2025, PMLR)
Li, Yingzhen; Mandt, Stephan; Agrawal, Shipra; Khan, Emtiyaz (Ed.)
Off-policy evaluation (OPE) is one of the most fundamental problems in reinforcement learning (RL) to estimate the expected long-term payoff of a given target policy with \emph{only} experiences from another behavior policy that is potentially unknown. The distribution correction estimation (DICE) family of estimators have advanced the state of the art in OPE by breaking the \emph{curse of horizon}. However, the major bottleneck of applying DICE estimators lies in the difficulty of solving the saddle-point optimization involved, especially with neural network implementations. In this paper, we tackle this challenge by establishing a \emph{linear representation} of value function and stationary distribution correction ratio, \emph{i.e.}, primal and dual variables in the DICE framework, using the spectral decomposition of the transition operator. Such primal-dual representation not only bypasses the non-convex non-concave optimization in vanilla DICE, therefore enabling an computational efficient algorithm, but also paves the way for more efficient utilization of historical data. We highlight that our algorithm, \textbf{SpectralDICE}, is the first to leverage the linear representation of primal-dual variables that is both computation and sample efficient, the performance of which is supported by a rigorous theoretical sample complexity guarantee and a thorough empirical evaluation on various benchmarks.
more » « less
Free, publicly-accessible full text available May 3, 2026
Spectral Representation for Causal Estimation with Hidden Confounders

Sun, Haotian; Moulin, Antoine; Ren, Tongzheng; Gretton, Arthur; Dai, Bo (May 2025, PMLR)
Li, Yingzhen; Mandt, Stephan; Agrawal, Shipra; Khan, Emtiyaz (Ed.)
We study the problem of causal effect estimation in the presence of unobserved confounders, focusing on two settings: instrumental variable (IV) regression with additional observed confounders, and proxy causal learning. Our approach uses a singular value decomposition of a conditional expectation operator combined with a saddle-point optimization method. In the IV regression setting, this can be viewed as a neural network generalization of the seminal approach due to Darolles et al. (2011). Saddle-point formulations have recently gained attention because they mitigate the double-sampling bias and are compatible with modern function approximation methods. We provide experimental validation across various settings and show that our approach outperforms existing methods on common benchmarks.
more » « less
Free, publicly-accessible full text available May 3, 2026
Skill Transfer and Discovery for Sim-to-Real Learning: A Representation-Based Viewpoint

https://doi.org/10.1109/IROS58592.2024.10801637

Ma, Haitong; Ren, Zhaolin; Dai, Bo; Li, Na (October 2024, IEEE)

Full Text Available
HYDRA: Model Factorization Framework for Black-Box LLM Personalization

Zhuang, Yuchen; Sun, Haotian; Yu, Yue; Qiang, Rushi; Wang, Qifan; Zhang, Chao; Dai, Bo (December 2024, Curran Associates, Inc.)

Personalization has emerged as a critical research area in modern intelligent systems, focusing on mining users' behavioral history and adapting to their preferences for delivering tailored experiences. Despite the remarkable few-shot capabilities exhibited by black-box large language models (LLMs), the inherent opacity of their model parameters presents significant challenges in aligning the generated output with individual expectations. Existing solutions have primarily focused on prompt design to incorporate user-specific profiles and behaviors; however, such approaches often struggle to generalize effectively due to their inability to capture shared knowledge among all users. To address these challenges, we propose HYDRA, a model factorization framework that captures both user-specific behavior patterns from historical data and shared general knowledge among all users to deliver personalized generation. In order to capture user-specific behavior patterns, we first train a reranker to prioritize the most useful information from top-retrieved relevant historical records. By combining the prioritized history with the corresponding query, we train an adapter to align the output with individual user-specific preferences, eliminating the reliance on access to inherent model parameters of black-box LLMs. Both the reranker and the adapter can be decomposed into a base model with multiple user-specific heads, resembling a hydra. The base model maintains shared knowledge across users, while the multiple personal heads capture user-specific preferences. Experimental results demonstrate that \method outperforms existing state-of-the-art prompt-based methods by an average relative improvement of 9.01% across five diverse personalization tasks in the LaMP benchmark.
more » « less
Full Text Available

« Prev Next »

Search for: All records